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I. INTRODUCTION 


The technological advances in high-speed, general purpose processors have outpaced the 
support provided by main memory systems. In addition, software applications continue to grow in 
processor and memory requirements. The major factors in the design of memory systems are size 
of address space, bandwidth required, main-memory latency, and memory subsystem cost. Large 
memory subsystems use dynamic random-access memories because of their low cost per bit. 
Caching schemes, which employ high-cost, high-speed memories, are used to Overcome main- 
memory latency and increase bandwidth. However, main memory latency, which is the time (in 
processor cycles) between the start of a memory fetch and the start of the transfer of requested data, 
is significant and increasing [PRZYBY90]. Further gains in memory system performance are 
possible through the use of different manufacturing processes (CMOS, BiCMOS, ECL and GaAs) 
[VAGTS92] and stringent design of the memory hierarchy. One such memory performance 
enhancement is the prediction of a cache-miss read address request to main memory. If the read 
address is predicted and the data made available, then the overall system performance is improved. 

Since current RISC processors far exceed the capability of main memory systems, the focus 
for the computer systems architect is how to improve the performance of the memory hierarchy. 
Large, fully-associative caches are cost prohibitive, and direct-mapped caches offer an excellent 
alternative [HILL88]. Direct-mapped caches have a higher miss rate than fully-associative or set- 
associative caches. A disadvantage of cache memories, in general, is the miss 
penalty[PATHEN90],[PRZYBYZ90]. The reduction of the miss rate and subsequent miss penalty 
is the motivation for the memory prediction buffer (MPB). 

Conceptually, the MPB is an enhancement for the data cache. The behavior of processors 
utilizing separate data and instruction caches is noted in this research and 
others{[JOUPPI90], [PRZYBY90]. Examination of this behavior shows that instruction caches and 
data caches behave differently. Instruction caches can improve effectiveness by simply prefetching 
the next instruction. This approach is shown to be less effective for data caches 
[PATHEN90],(JOUPPI90]. If this approach is used for data cache management, it contributes to 


pollution of the cache and increases the number of capacity misses. Since most modern RISC 


processors have separate instruction and data caches, and employ some prefetch mechanism for the 
instruction cache, this research will focus on improving the effectiveness of the data cache by 
inserting an MPB between the cache and its refill line (main memory, in most cases). Although this 
organization is the focus for this research, it is not the only implementation possible for the 


MPB[NOWICKI92]. 


IH. MEMORY HIERARCHY AND LATENCY REDUCTION 


The von Neumann architecture, used by most single-instruction-single-data! (SISD) and 
single-instruction-multiple-data (SIMD) machines, has some baseline behavioral characteristics to 
consider [HWANG84]. The characteristics of the memory subsystem provide the parameters for 
optimization of the operational behavior of the memory subsystem in conjunction with the 
processor and secondary storage. First, stored programs obey the principle of J/ocality 
[PATHEN90]. This principle has two components which state that programs, while executing, 
favor only a portion of their address space at a given instant. The two components are: 


¢ Spatial Locality - Programs tend to request data and instructions that have memory 
addresses near the instructions and data currently being used. The von Neumann 
architecture provides for the execution of sequential program instructions and programs use 
related data items which are likely to be adjacently stored. 


¢ Temporal Locality - Programs tend to use current information and data. That is, if an item is 
referenced, it will probably be referenced again soon. The older the information, the less 
likely it is that the program will again reference it. Temporal locality is especially evident in 
the execution of program loops where instruction and data are used several times within a 
Short period of time. 


With reference to these principles, high-speed buffers are inserted between the main memory 
and the processor. These buffers are known as caches. The caches store portions of main memory 
which are currently in use by the executing program. This allows rapid access by the processor of 
the instructions and data needed to continue processing. Although the cache does a great job of 
hiding main memory latency, a disadvantage of its use is the penalty for a cache miss. The 
construction of the cache gives the following behavioral characteristics for a cache miss. 


¢ Compulsory - cache misses that occur when a block is first accessed and the program 1s just 
Starting. These are sometimes called co/d start misses since the cache has never held the 
information requested. 


¢ Capacity - cache misses that occur when discarded blocks are again referenced by the 
executing program. These misses are inevitable since the cache size is less than main 
memory Size. 


* Conflict - the block placement strategy dictates conflict misses. Conflict misses occur when 
a block is discarded because too many incoming blocks map to the same set and the 


1. Flynn’s classification (1966) is based on the multiplicity of instruction streams and data streams 
in a computer system [HWANG84]. 


discarded block is soon needed. This characteristic is evident in both set-associative mapped 
and direct-mapped caches. 


The structure of the memory subsystem is given in Figure |. Traversing down the hierarchy, 
access time increases and the storage size increases. However, bandwidth decreases significantly 
while traversing the hierarchy, top to bottom. Some nominal figures for size and bandwidth are also 
given in Figure 1. It is worthy to note that each level is a subset of the next lower level. That is, each 
level contains only a subset of the information contained in the next lower level. This presents a 
constraint of maintaining coherency (correct information) throughout the hierarchy. The MPB 
receives its information from the next lower level of the hierarchy. In this research, the next level 
of the hierarchy is the main memory. For the development of the concept of the MPB and for most 
of the simulation described here, the MPB is not involved in the write policy of the cache. The MPB 
always gets its data from the main memory which 1s kept up to date. Further research of the MPB 
will study the implementation of a write-through policy for coherency. Write-back performance 


will also be examined in follow-on research. 
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Figure 1: Memory Hierarchy 


Ili. PERFORMANCE METRICS 


In order to investigate the performance of the memory subsystem, characteristics of the memory 
subsystem must be developed. From the system perspective, work completed in time defines sys- 


tem performance. Hence, system performance can be described analytically as Equation 1. 


Instructions Completed 


System Performance = Elansedsiiie 


(1) 

This definition of system performance does derive the ubiquitous MIPS units. This unit of mea- 
surement should not be used in comparison of different systems performing the same task 
[PATHEN90]. However, for characterization of a specific system performing the same task, this 
unit of measure is useful. This measure of performance can be focused in terms of processor 


cycles. Efficiency is a product of the number of instructions executed, the number of clock cycles 


per instruction and the clock speed (Equation 2). 


E =1-CPI-f (2) 
Expanding this model, the number of cycles per instruction executed is the metric that is 
directed influenced by the memory subsystem. Statistically, a more stable metric is the effective 


CPI. The effective CPI is the statistical average of several measurements. The effective CPI is 


Ger 
CPi a= am (3) 


l 
The number of cycles per instructions is largely determined by processor architecture and regis- 
ter/cache structure(effectiveness). With a focus toward the memory structure, the effective access 
time of the memory subsystem is the best metric to indicate memory subsystem performance. This 
parameter depends on the cache access time and the main memory access time. By decreasing the 


number of cycles per instruction, the system performance is improved. The speedup in system per- 


formance is modelled by Equation 4. 


CPlerep—CPlerrcy psy a CPlerr(mps) (4) 


§ = eee la 
CPlreror CPlerr 


The nominal figures for the number of cycles per instruction in high performance processors is 
1.2-2.0 CPI. If we assume that the processor can execute instructions at the bandwidth of the mem- 
ory subsystem, the speedup becomes a function of the effective access time of the memory sub- 
system. Equation 5 determines the speedup of a given system by reference to the effective access 


time with the MPB and without the MPB. 


Tre 
(M PB) 
§ = |-—_{—_ (5) 
Tra 
The effective access time measures the memory hierarchy performance. The effective access 
time is therefore, a function of the cache performance and main memory performance as noted in 


Equation 6. 


Tea = TestCup Tort A-Cyp) ° Test Tur tcp) (6) 
This relationship can be simplified by noting the time for a cache tag search 7; is very small. In 


addition, the cache tag search and cache fetch are much smaller than the time to read/fetch data 


from main memory, Ty. The effective access time can then be approximated as in Equation 7. 


Tea™ Cur Tort (1-Crp) * (Typ) (7) 
This approximation can be used only for comparison between simulation models. The descrip- 
tion given by Equation 6 must be used for evaluation of the simulation model with respect to 


implementation performance. 


IV. MEMORY PREDICTION BUFFER 


The memory prediction buffer(MPB) was conceived to predict the next cache-miss address 
and prefetch the data before the request is made by the processor. The MPB can be inserted between 


the cache and its refill line as depicted in Figure 2. Another possible configuration could be the use 
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Figure 2: MPB With Cache Implementation 


of smaller MPBs attached to individual memory chips (DRAMs). This implementation is realized 
in recent work by Nowicki{NOWICK972]. A block diagram of this approach is given in Figure 3. In 
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Figure 3: MPB With Main Memory Implementation 


the early research of this idea, efforts turned instinctively toward statistical methods for prediction. 
The area of digital signal processing was explored for possible solutions to the prediction 
requirement HAMMINS83],[THERRI92]. Kalman filters, Wiener filters and other adaptive 
techniques for prediction were proposed and investigated. However, further characterization of the 


problem provided more specifications for possible solutions. 


Cache simulation was achieved using Mark Hill’s DINEROIII cache simulator. The model 
cache is a direct-mapped, 8K data, 8K instruction with a 32 byte line size. Using various ATUM 
traces[GRIMSR92] and DEC  traces[BORG90], cache miss addresses were 
investigated[AGARWL86]. Review of the traces show that spatial locality and temporal locality are 
valid for all processes. Since no curves are noted in the traces, prediction should employ linear 


methods. The physical construction of the memory prediction buffer is given in Figure 4. The 
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Figure 4: Memory Prediction Buffer 


simulation was configured to give the number of cache hits before a miss is encountered. The 
average of these miss events give the constraint of time available to predict and prefetch a miss 
address. Since the average of cache-hits before a cache-miss is 4-6, it is possible that some 6-10 
cycles are available for prediction and prefetch. In addition, the system bus bandwidth must be 


considered for prefetch solution. These constraints were responsible for the development of a 


10 


simpler prediction algorithm. The prediction algorithm yields a bias for the ensuing prefetch. The 


algorithm is implemented in C for simulation. 
If the current address is larger than the past address, then the bias is positive (negative 


otherwise). The algorithm for the MPB is given in Figure 5. The determination and application of 
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Figure 5: Memory Prediction Buffer Algorithm 


the bias is central to the algorithm. The bias is simply the difference in address boundaries (if word 
aligned) of the previous address and the current address. If the address requested is greater than 32K 
away, another address stream bias is established. The corresponding address stream bias is used to 


predict the next requested address. The bias may be positive or negative, that is, ascending or 


1] 


descending in memory. The correct address stream bias is determined using a simple but fast binary 
search. The search time can be reduced further using a fully associative algorithm. 

The structure of the memory prediction buffer is similar to a conventional fully-associative 
cache. The MPB is composed of m lines of n byte blocks. For the cache used in this research, the 
MPB has 16-256 lines of 32 byte blocks. The blocks are aligned on the same address(word) 
boundaries as the first level cache. The block size is dependent on the block size of the first level 
cache. The optimal size of the MPB is 64-256 lines. This size is due to the fan-out requirements (and 
costs) for the construction of a fully associative cache and the number of lines (sets) needed to allow 
effective use of the replacement policy used (random replacement vice LRU, FIFO, etc.). If a LRU 
replacement policy is used instead of random replacement, a smaller MPB can be used to give the 


same performance improvement. 
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V. MEMORY PREDICTION BUFFER PERFORMANCE 


A. MPB THEORECTICAL PERFORMANCE 

The memory prediction buffer determines the future cache miss address using previous cache 
miss addresses. For this analysis, only the data cache is given a MPB. The instruction cache is set 
to prefetch instructions. Given a model cache with a hit ratio of 93.2%, if the MBP is found to be 
correct on 33% of its predictions, an increase of 2.1% is realized for the cache hit rate. The effective 


cache hit ratio is improved to 93.2% from 95.3%. The graph of Figure6 gives the effective cache 
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Figure 6: MPB Performance Graph 
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hit rate as a function of MFP effectiveness. There are four cache models that are compared. One 
model has an 80% initial hit rate, another model has an 85% hit rate and so on. A sample reading is 
shown for a base cache hit ratio of 80% with an MPB effectiveness rating of 20%. The resulting 
effective cache hit ratio for this sample is 84%. This is an increase of 4% in the effective cache hit 
ratio. The resulung system performance achieves a speedup of 9%. 

The model system for this investigation has 10ns cache memory and 80ns main memory. This 
model memory hierarchy is used by the simulanon study also. The cycle time of the main memory 


is not considered but would add to the effectiveness of the MPB. 
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B. BASELINE SYSTEM PERFORMANCE 

In order to compare the performance of the MPB to existing latency reduction strategies, 
several measurements of the baseline system had to be collected and examined. This baseline 
system was constructed using the cache simulator, DINEROIII. The system simulates separate 8K 


direct-mapped data and 8K direct-mapped instruction caches. 


Table 1: BASELINE SYSTEM PERFORMANCE 
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Table 1: BASELINE SYSTEM PERFORMANCE 
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C. MPB SIMULATION PERFORMANCE 










The theoretical study of the MPB was realized when implemented using trace-driven 
simulation (TDS)[GRIMSR92] with the DINEROIII cache simulator (provided by Mark Hill). As 
with any TDS research, address traces and their accuracy are critical to proper simulation. For this 
research, ATUM traces[AGARWL86] and DEC Titan[BORG90] traces were used. Some 


behavioral characteristics of the simulation are graphically illustrated in the appendix. Table 2 gives 


Table 2; MEMORY PREDICTION BUFFER PERFORMANCE(DEC) 
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a summary of MPB performance for two processes and two runs of each. SOR is Renato Deleones’ 





Blocks 
per line 












successive over-relaxation algorithm that uses sparse matrices. TREE is Joel Bartletts’ program 
which builds a tree data structure and searches for the largest element in the tree. His program is a 
variant of LISP. Both of these process traces were provided by DEC WRL. The model system is a 
RISC processor with separate 8K instruction and 8K data caches. There are 32-byte blocks in the 
cache and in the MPB. The cache is direct-mapped for reasons given by [HILL88]. The initial cache 
hit rate CHR was before the insertion of the MPB. The local hit rate for the MPB is given under 
MHR. The overall hit rate for the cache and MPB combined is listed under NHR. The speedup is 
listed for the overall system. For these examples, each line of the MPB consists of 32-byte 
lines(blocks) and 128 lines. Each line is boundary aligned in the same way as the cache. That is, just 
as the cache may use word aligned blocks, so does the MPB. This MPB simulation used a random 
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replacement policy for the removal of lines. Toward the end of this research effort, a MPB was 
simulated using a least-recently used (LRU) replacement policy. Several simulations using this 
replacement policy showed that the number of lines in the MPB could be reduced while maintaining 
the effectiveness of the MPB. In particular, 64 lines were shown to perform nearly as well as 128 
lines. For the simulation results of Table 2, the speedup numbers are modest but, the cost of this 
implementation is minimal when compared to a 256K next level cache[PATHEN90]. 

In addition to the simulations using the DEC traces, simulations were also done using ATUM 


traces. Table 3 list results of simulation using ATUM traces. The model system is the same as used 


Table 3; MEMORY PREDICTION BUFFER PERFORMANCE (ATUM) 
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in the DEC trace simulation. These simulation results can be used to motivate further research. 









ATUM traces are relatively short for cache modelling and behavior analysis. Each trace is 
approximately 400,000 addresses. This number of addresses is marginally adequate for a 32K cache 
simulation and larger cache-size simulation would require a larger number of addresses for proper 
and accurate simulation. 

For the preceding research, a random-replacement policy was used by the MPB. An early 
implementation of the MPB using a least-recently-used (LRU) policy shows improved performance 


over the random-replacement algorithm. . Table 4 lists the results of this research using the process 
Table 4: MEMORY PREDICTION BUFFER PERFORMANCE (LRU) 


MPB Blocks 
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“tree”. Results of this implementation using other processes were not yet accomplished at the time 
of the report. As evidenced by all these simulation studies, the MPB is shown to be a favorable 
architectural concept for consideration in systems where the highest possible performance is desired 


and systems costs are constrained. 


iy 


VI. CONCLUSIONS 


The memory prediction buffer is proposed as a component for high performance computer 
systems. The widening gap between processor speed and memory subsystems require the 
investigation of alternative architectures for reducing main memory latency while restraining costs. 
The MPB outperforms prefetch always strategies by allowing addressing in the up and down 
direction. In addition, the MPB does not contribute to pollution of the cache. Effective memory 
latency reduction must be addressed at the time of system design. In addition, as the requirements 
for a larger address space grows, memory heirarchy design and implementation will continue to 
increase in complexity. The implementation of a MPB is less expensive than a next-level cache and 
delivers a comparable performance enhancement. In addition, the algorithm used can be tailored to 
the proposed system environment to provide a more effective latency reduction structure. The MPB 


is shown to improve overall system performance and provide reasonable gains in speedup. 
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VII. RECOMMENDATIONS FOR FUTURE RESEARCH 


The memory prediction buffer is studied and simulated for enhancement of the data cache of 
a uniprocessor. Its use or enhancement in a multiprocessor environment is not yet known. In 
addition, the question of whether the MPB can be used to significantly enhance the performance of 
the instruction cache has not fully been explored. The algorithm for the MPB of this research 
focused on a random replacement policy for discarding lines. The LRU replacement policy showed 
an improvement over random however, the effect of other replacement policies is available for 
discussion. Simulation and study of the memory bandwidth required to support an architecture with 
a MPB and without a MPB is needed. A comparison of the amount of bandwidth required by the 
base architecture (cache and processor) with the bandwidth required by the architecture with a MPB 
installed, is useful. The cache write-back policy and its effect on systems performance with and 


without an MPB is an area open for study. 
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